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Summary: Existing state-of-the-art methods that take a single RNA sequence and predict the 
corresponding RNA 6 are thermodynamic methods. These predict the most stable RNA struc- 
ture, but do not consider the process of structure formation. We have by now ample experimental 
and theoretical evidence, however, that sequences in vivo fold while being transcribed and that 
the process of structure formation matters. We here present a conceptually new method for 
predicting RNA 13, called CoFold, that combines thermodynamic with kinetic considerations. 
Our method significantly improves the state-of-art in terms of prediction accuracy, especially 
for long sequences of more than a thousand nucleotides length such as ribosomal RNAs. 

Introduction: The primary products of almost all genomes are transcripts, i.e. RNA se- 
quences. Their expression is often regulated by RNA structure which forms when the transcript 
interacts with itself via hydrogen-bonds between complementary nucleotides (G-C, A-U, G-U). 
These structures regulate translation, transcription, splicing, RNA editing and transcript degra- 
dation. To assign a potential functional role to a transcript, it often suffices to know its RNA 8, 
i.e. the set of base pairs. As entire transcriptomes are now routinely sequenced, computational 
methods that predict RNA fi for individual input RNA sequences play a key role in annotat- 
ing new transcripts. This is emphasised by the fact that the majority of mammalian genomes 
is transcribed into transcripts of unknown function!^ and that experimental techniques for 
RNA structure determination such as X-ray crystallography and NMR remain costly and slow. 

More than three decades of research have been invested into devising methods that take a 
single RNA sequence and predict its RNA B. When homologous sequences from related species 
are scarce or not available, non-comparative methods such as RNAfoldSJ and Mfold^ provide 
the state-of-art in terms of prediction accuracy. They employ an optimisation strategy that 
searches the space of potential 6s for the most stable structure and depend on hundreds of free 
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energy parameters that have been initially experimentally determined^ and computationally 
tweaked^. Recent attempts at replacing these thermodynamic parameters by probabilistic ones 
have lead to a similar or slightly improved prediction accuracy^. All non-comparative thermo- 
dynamic methods, however, show a marked drop in performance accuracy for increased sequence 
lengths. 

Key experiments* 8 * 9 * ^ from the early 1980s show that structure formation happens co- 
transcriptionally, i.e. while the RNA is being transcribed. Many experiments* 11 * 12 * 13 * 14 * 15 * 16 * 17 * 18 * 19 ! 
have since substantiated this view. In 1996, Morgan and Higgs* 2 ^* studied the discrepancies be- 
tween the conserved RNA 8s and the corresponding, predicted minimum free energy (MFE) 
structures for long RNA sequences and concluded that these differences "cannot simply be put 
down to errors in the free energy parameters used in the model" . They hypothesised that these 
differences may be due to effects of kinetic folding. These results are complemented by statistical 
evidence that structured transcripts not only encode information on the functional RNA struc- 
ture, but also on their co-transcriptional folding pathway^. While there is thus ample evidence 
that the process of structure formation matters to the formation of the functional structure in 
vivo, it is ignored by thermodynamic methods for RNA 8 prediction. 

Several sophisticated computational methods have already been devised that explicitly mimic 
the co-transcriptional structure formation in ctW 22 * 23 * 24 * 25 * 26 * 27 * 28 * 29 * 30 *. These folding pathway 
prediction methods make a range of simplifying assumptions and approximations of the complex 
in vivo environment. So far, these methods have only been used to study a few select and 
typically short (<C 1000 nt) sequences and an evaluation of their prediction accuracy is currently 
missing. 

We here propose a conceptually new method, called CoFold, that combines the benefits of 
a deterministic, thermodynamic method with kinetic considerations that capture effects of the 
structure formation process. For this, we build upon the state-of-the-art method RNAfoleJ^ 
by combining its thermodynamic energy scores with a scaling function. We train the two free 
parameters of CoFold on a large and diverse data set of 248 sequences and examine the predic- 
tive power of CoFold on a non-redundant data set of 61 long sequences (> 1000 nt). CoFold 
shows a significant improvement in prediction accuracy, in particular for long RNA sequences 
such as ribosomal RNAs. 
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TPR (%) FPR (%) PPV (%) MCC (%) 



RNAfold 
RNAfold-A 

COFOLD 

CoFold-A 



46.30 0.0176 39.74 42.81 

52.02 0.0160 44.76 48.17 

52.83 0.0159 45.79 49.10 

57.80 0.0145 50.06 53.70 



Table 1. Prediction accuracy of CoFold for base pairs. The performance accuracy of 
CoFold, CoFold-A, RNAfold and RNAfold-A for the long data set in terms of true 
positive rate (TPR = 100 • TP/ (TP + FN)), false positive rate (FPR = 100 • FP/(FP + TN)), 
positive predictive value (PPV = 100 • TP/ (TP + FP)) and Matthew's correlation coefficient 
(MCC = 100 • (TP -TN -FP- FN)/y/(TP + FP) ■ (TP + FN) ■ (TN + FP) ■ (TN + FN)), 
where TP denotes the numbers of true positives, TN the true negatives, FP the false positives 
and FN the false negatives. 

Folding long RNA sequences We evaluate the prediction accuracy of CoFold by compar- 
ing the B predicted by CoFold to the known reference 6s for a test set of 61 long sequences 
(long data set). We compile this data set by identifying sequences that are long (> 1000 nt), 
correspond to biological sequences, have reference structures that are supported by phyloge- 
netic evidence and are non-redundant in terms of pairwise percent sequence identify (max 85%) 
and evolutionary distance (Supplementary Information, Section 1, Table [2j Table [3] and Ta- 
ble [4]). These selection criteria yield a data set of 16S ribosomal RNA (rRNA) and 23S rRNA 
sequences from archaea, bacteria, eukaryotes and chloroplasts with an average length of 2397 nt 
(min 1245 nt, max 3578 nt). 

Compared to RNAfold which is the state-of-the-art thermodynamic RNA structure predic- 
tion method, CoFold predicts 7% more known base pairs at 6% higher specificity than RNA- 
FOLD thereby increasing the Matthew's correlation coefficient (MCC) by 6% (MCC (RNAfold) 
= 42.81%, MCC (CoFold) = 49.10%) (Table [TJ Supplementary Information, Section 2). This 
improvement in overall performance accuracy can be attributed to a simultaneous increase of 
the positive predictive value (PPV) and the true positive rate (TPR) for almost all individual 
sequences (Figure [T]) and a simultaneous slight decrease of the false positive rate (FPR) (Sup- 
plementary Information, Section 2 and Figure [5]). Both RNAfold and CoFold employ the 
default Turner 1999 free energy parameters^. Combining CoFold with the Andronescu 2007 
free energy parameters^ (CoFold-A) increases the sensitivity and specificity by a further 4% 
(MCC (CoFold- A) = 53.70%). Doing the same with RNAfold (RNAfold-A) also increases 
the sensitivity and specificity with respect to RNAfold, but results in a smaller performance in- 
crease than for CoFold (MCC (RNAfold-A) = 48.17%, MCC (CoFold) = 49.10%). Whereas 
CoFold only depends on two free parameters, the Andronescu 2007 free energy model^ com- 
prises 363 free parameters that were trained using machine learning techniques (Supplementary 
Information, Section 3). 

Explicitly capturing the structure formation process In order to capture effects of co- 
transcriptional folding in CoFold, we introduce a scaling function 7(d). This function scales 
the nominal energy contribution of any base-pair-like interaction depending on the distance 
d of the interaction partners along the sequence (Supplementary Information, Section 3 and 
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Figure [6]). It thereby captures that during the structure formation process, potential pairing 
partners in close proximity are easier to identify than more distant ones. This scaling amounts 
to a re-weighing of the structure search space that the structure prediction algorithm explores. 
Rather than guiding the structure prediction solely based on thermodynamic considerations as 
the state-of-the-art methods RNAfold and MfoldHI do, CoFold thus combines kinetic and 
thermodynamic considerations. 

The scaling function of CoFold depends on two free parameters, a and r which have a 
straightforward interpretation (Supplementary Information, Section 3 and Figure [6]). Our goal 
in training the two parameters was to confirm that parameter training is robust and to ensure 
that CoFold can be applied across a wide range of sequence lengths. 

To this end, we compiled an extended data set of 248 sequences (combined data set) which 
comprises the 61 long sequences of the long data set and, in addition, 187 short sequences 
(< 1000 nt length) that also correspond to biological sequences whose reference structures are 
supported by phylogenetic evidence (Supplementary Information, Section 1, Table [2j Table [3] 
and Table [4]) . The sequences in this combined data set have an average length of 776 nt (min 
110 nt, max 3578 nt). Using twenty trials of five- fold cross-validation for parameter training, we 
find that the optimal prediction accuracy in terms of average MCC is obtained by a combination 
of a and r values whose strong correlation can be described by a linear function a = a ■ r + b, 
where a = 6.1 • 10~ 4 ± 2 • 10 -5 is the slope and b = 0.105 ± 0.016 the intercept (R 2 = 98.4%) 
(Supplementary Information, Section 4 and Figure [8] (left)). Our cross-evaluation experiments 
yield optimal parameter combinations that fall within or near the 95% confidence interval around 
the linear fit, thus confirming the robustness of parameter training (Supplementary Information, 
Figure [8] (right)). We use a = 0.50 and r = 640 in CoFold and CoFold-A (Supplementary 
Information, Figure [6]). 

CoFold and CoFold-A outperform RNAfold and RNAfold-A also for short sequences 
(< 1000 nt), although the improvement in terms of MCC is less pronounced than for long se- 
quences (Supplementary Information, Table [5]). RNAfold shows a slight decrease in prediction 
accuracy when used with the Andronescu 2007 parameters. The behaviour of CoFold is in line 
with our expectation that the beneficial impact of modelling co-transcriptional folding decreases 
for short sequences. 

We conclude that CoFold effectively depends only on one free parameter and that CoFold 
and CoFold-A increase the prediction accuracy for all sequence lengths, in particular for long 
sequences (> 1000 nt). 

Capturing structure formation yields improved structures of similar free energies 

In order to examine if capturing the effects of co-transcriptional folding significantly changes 
the free energies of the predicted structures, we calculated the free energies of the structures 
predicted by CoFold, CoFold-A and RNAfold-A and compared them to the free energies 
of the corresponding structures predicted by RNAfold. To ensure consistency, we used the 
Turner 1999 energy parameters to calculate the energies of all predicted RNA structures. 

The structures predicted by CoFold for the long data set differ on average by 2% from the 
respective free energies of the corresponding structures predicted by RNAfold and the distribu- 
tion of relative energy differences is comparatively tight (stdev = 1.0%, min = 0.2%, max = 4.4%) 
(Figure [2] and Supplementary Information, Table [6]). Combining CoFold and RNAfold with 
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the Andronescu 2007 energy parameters significantly increases the average free energy differ- 
ence (5% (RNAfold-A), 7% (CoFold-A)), broadens the distributions (stdev(RNAFOLD-A) = 
1.9%, stdev(CoFOLD-A) = 2.4%) and leads to higher maximum energy differences (max(RNA- 
fold-A) = 11.1%, max(CoFOLD-A) = 13.1%). For short sequences, these differences are even 
more pronounced (Supplementary Information, Table [6]). 

Most importantly, a large energy difference with respect to the free energy of the structure 
predicted by RNAfold does not imply an increased prediction accuracy, neither for short nor 
long sequences and for none of the prediction programs (Supplementary Information, Figure 11 
and Table [7]). 

To summarise, CoFold significantly increases the prediction accuracy without significantly 
altering the free energies of the structures that RNAfold would predict for the same input 
sequences. 

Folding ribosomal RNAs 23S ribosomal RNAs are the longest sequences of the long data 
set with an average length of 3069 nt (min 2882 nt, max 3578 nt) and are thus some of the 
most challenging RNA structures to predict. Using CoFold and CoFold-A, we increase their 
prediction accuracy in terms of MCC w.r.t. RNAfold on average by 8% and 12%, respectively. 
Figure [3] shows, for the 23S rRNA of the gamma-proteobacteria Pseudomonas aeruginosa, how 
the RNA structure predicted by CoFold-A compares to that predicted by RNAfold. The 
most apparent differences are that RNAfold predicts many incorrect mid- and long-range base 
pairs (red arcs spanning more than 100 nt) and that almost all of these disappear with CoFold- 
A. In addition, CoFold-A adds many correct mid- and long-range base pairs (blue arcs), see in 
particular those spanning almost the entire sequence. Overall, CoFold-A increases the MCC 
of RNAfold from 43% to 58%. This 15% rise in performance accuracy is due to a significant 
increase of the true positive rate (45% — > 61%) and an equally significant increase of the positive 
predictive value (41% — > 56%). This is in-line with is the typical behaviour seen for CoFold 
(Figure [T] and Supplementary Information, Figure [5]). The false positive rate for both prediction 
methods remains low at 0.01%. 

We also investigated the performance for the 16S ribosomal RNAs in greater detail. With 
an average length of 1550 nt (min 1245 nt, max 1799 nt), these are significantly shorter than 
the 23S rRNAs in our long data set, but still considerably longer than the average test sequence 
on which thermodynamic prediction methods are typically benchmarked. Figure [4] shows the 
improvements in prediction accuracy for the 16S rRNA of the freshwater algae Cryptomonas sp. 
(species unknown). This ribosomal sequence is 1493 nt long. CoFold-A improves the prediction 
accuracy of RNAfold from an MCC of 32% to 73%. This 41% improvement in performance 
accuracy is achieved by significantly reducing the number of erroneously predicted mid- to long- 
range base pairs (red arcs spanning more than 100 nt) while simultaneously increasing the 
number of correctly predicted base pairs in wide distance range (blue arcs). This is reflected 
by the simultaneous increase of the true positive rate (33% — > 77%) and the positive predictive 
value (30% — > 69%) which, in this example, is also accompanied by a slight reduction of the 
false positive rate (0.03% 0.01%). 

As neither CoFold nor RNAfold are technically capable of predicting pseudo-knotted 
features, the pseudo-knotted reference structures of the 23S rRNA and the 16S rRNA cannot 
be predicted with perfect accuracy (see orange arcs in Figure [3] and Figure [4| . 
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Discussion Our results show that the state-of-the-art in non-comparative RNA 13 prediction 
can be significantly improved by capturing effects of the structure formation process. To this 
end, we introduce a conceptually new RNA 6 prediction method called CoFold which judges 
the reachability of potential pairing partners during co-transcriptional structure formation via 
a scaling function. We reliably train the two free parameters of CoFold using a large and 
diverse data set of 248 sequences. This scaling function effectively depends on only one free 
parameter which has a straightforward interpretation as it determines how the reachability de- 
clines as function of the nucleotide distance. Without altering the free energy parameters of the 
underlying thermodynamic model, CoFold thereby guides the structure prediction process by 
a combination of thermodynamic and kinetic considerations. It thereby arrives at significantly 
more accurate structure predictions, in particular for long sequences (> 1000 nt) such as ribo- 
somal RNAs. The significance of the improvement in prediction accuracy is underlined by the 
improvement in sensitivity and specificity for the individual sequences. Most importantly, this 
improvement is gained without significantly shifting the free energies of the predicted RNA struc- 
tures. We thereby confirm Morgan and Higgs 20 who hypothesised that discrepancies between 
the functional RNA fi and the corresponding minimum free energy structures predicted by ther- 
modynamic methods such as RNAfold are not due to errors of the underlying free energy 
parameters, but due to a lack of modelling the effects of kinetic structure formation. 

Many sophisticated experiments* 8 * 9 * 10 * 11 * 12 * 13 * 14 * 15 * 16 * 17 * 18 * 19 ! paint a dauntingly complex pic- 
ture of co-transcriptional structure formation in vivo which can depend on a multitude of fac- 
tors ranging from the speed of transcription and the variation thereof, to a range of carefully 
orchestrated cis and trans interactions with a variety of other molecules. 

Several sophisticated folding pathway prediction methods have already been devised that 
explicitly mimic the co-transcriptional structure formation in mW 22 * 23 * 24 * 25 * 26 * 27 * 28 * 29 * 30 l Even 
though these methods need to make a range of simplifying assumptions to approximate the 
complex in vivo environment and have so far been been only used to investigate a few select and 
typically short (<S 1000 nt) sequences, these methods have alread y allowed us to gain valuable 
and detailed insight into co-transcriptional folding pathway^ 26 * 31 !. 

By proposing a conceptually new approach to RNA 6 prediction, CoFold, we show that it 
is possible to capture effects of the structure formation process in deterministic thermodynamic 
methods and that the benefits of doing so are significant, both in terms of prediction accu- 
racy and insight gained. This finding is not too surprising given that any co-transcriptionally 
emerging RNA transcript in vivo needs to find a way to actually reach the functionally relevant 
RNA structure. Although CoFold only constitutes the first attempt at explicitly capturing the 
effects of co-transcriptional folding in a thermodynamic RNA 13 prediction program, we hope that 
our results will inspire a new generation of these methods that explicitly capture aspects of the 
structure formation process in vivo. Recent developments in experimental techniques* 17 * 19 ! will 
no doubt significantly contribute to our understanding of structure formation in vivo. We are 
thus looking forward to joint projects between the experimental and theoretical RNA structure 
community. 

Methods summary The algorithm of CoFold and the scoring function are described in 
detail in the Supplementary Information. CoFold is publicly available on the Internet via a 
web-server at http://www.e-rna.org/cofold, 
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Figure 1. Changes in prediction accuracy for the structures predicted by CoFold 
for individual sequences. We report the prediction accuracy for base pairs of the long data 
set in terms of absolute changes of the true positive rate (TPR = 100 • TP /{TP + FN)) and 
the positive predictive value (PPV = 100 • TP /{TP + FP)) by comparing the prediction 
accuracy of the structures predicted by CoFold to those predicted by RNAfold. TP denotes 
the numbers of true positives, TN the true negatives, FP the false positives and FN the false 
negatives, see Supplementary Information Section 2 for detailed definitions. 
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Figure 2. Relative free energy differences of the predicted structures w.r.t. the 
MFE structures predicted by RNAFOLD. Summary of three distributions for the long 
data set showing the relative free energy differences of the RNA structures predicted by 
RNAfold-A w.r.t. the MFE structures predicted by RNAfold for the same sequence (left), 
of the RNA structures predicted by CoFold w.r.t. the MFE structures predicted by 
RNAfold (middle) and of the RNA structures predicted by CoFold-A w.r.t the 
MFE structures predicted by RNAfold-A (right). The free energies of all structures are 
calculated using the Turner 1999 energy parameters. For each of the three distributions, the 
dark horizontal line indicates the average, the box indicates the 1st to the 3rd quartile, and the 
dotted lines indicate minimum and maximum values. Circles indicate outliers which are not 
included in the calculation of the average value. 



12 




Figure 3. RNA fis predicted by CoFold-A and RNAfold for the 23S rRNA of the 
gamma-proteobacteria Pseudomonas aeruginosa. The horizontal line corresponds to 
the RNA sequence of 2893 nt length. The structure predicted by RNAfold is shown above 
the horizontal line, the one predicted by CoFold-A below. Each arc corresponds to a 
base-pair between the two corresponding positions along the sequence. Blue arcs correspond to 
correctly predicted base pairs (true positives), red arcs to incorrectly predicted base pairs 
(false positives) and black arcs to base pairs that are part of the reference structure, but 
missing from the prediction (false negatives). Orange arcs indicate base pairs of the reference 
structure that render it pseudo-knotted. Figure made with R-CHlEpZl. 
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Figure 4. RNA fis predicted by CoFold-A and RNAfold for the 16S rRNA of the 
freshwater algae Cryptomonas sp.. The horizontal line corresponds to the RNA sequence 
of 1493 nt length. The structure predicted by RNAfold is shown above the horizontal line, 
the one predicted by CoFold-A below. Each arc corresponds to a base-pair between the two 
corresponding positions along the sequence. Blue arcs correspond to correctly predicted base 
pairs (true positives), red arcs to incorrectly predicted base pairs (false positives) and black 
arcs to base pairs that are part of the reference structure, but missing from the prediction 
(false negatives) . Orange arcs indicate base pairs of the reference structure that render it 
pseudo-knotted. Figure made with R-CHIEp2l. 



1 



Supplementary Information 

CoFOLD: thermodynamic RNA structure prediction with a kinetic twist 

Jeff R. Proctor and Irmtraud M. Meyer 
Centre for High-Throughput Biology & Department of Computer Science and 
Department of Medical Genetics, University of British Columbia, 
2125 East Mall, Vancouver, BC, 
Canada V6T 1Z4, irmtraud.meyer@cantab.net 

July 15, 2012 



Methods 

(1) Compilation of the long and combined data sets The long data set consists of 16S 
and 23S ribosomal RNAs only. 16S and 23S multiple-sequence alignments for bacteria, eukary- 
otes, archaea and chloroplasts were retrieved from the Comparative RNA Web site (CRW)^. 
Because no consensus RNA structure is provided for each alignment, we projected the individual 
structures onto the alignment. The structure with the lowest mismatch score (defined in Equa- 
tion ([!])) was chosen as the consensus RNA structure for each alignment. For the calculation of 
the mismatch score, base pairs with a gap in one base position, and a non-gap in the other are 
considered one-sided gaps. Base pairs with gaps on both sides are considered two-sided gaps. 
Non-canonical pairs are those other than G-C, A-U, G-U. The length of the alignment A is 
denoted by \A\. 

YlseaeA ' (# one-sided gaps) + (# two-sided gaps) + (# non-canonical pairs)) 

mismatch := —. 

\A\ 

(1) 

Sequences with large indels, many ambiguous nucleotides, or a poor fit to the consensus 
RNA structure were removed from the alignment. Unpaired regions of the alignment were re- 
aligned using MUSCLFM Individual sequences were extracted from each resulting alignment 
such that no pair of extracted sequences have a pairwise percent sequence identity greater than 
an alignment-specific threshold. The exact threshold varies to ensure no biological class or evo- 
lutionary clade is over-represented in the long data set (max 85%, Table [3]). For each extracted 
sequence, the consensus alignment structure was projected onto the sequence by removing base 
pairs at gap positions, and removing any non-canonical base pairs. The resulting 61 sequences 
have a mean sequence length of 2397 nt and constitute the long data set (Table [2j Table [3] and 
Table Q. 

The combined data set was constructed primarily for robustness of parameter training, and 
contains Rfam sequences from a wide variety of biological classes^. Rfam alignments were 
chosen according to the following criteria: 

• mean sequence length greater than 115 
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• covariation (defined in Equation ([2])) greater than 0.18 

• minimum of 5 sequences 

• mean percentage of canonical base pairs greater than 

• diverse biological classes and evolutionary clades 

Sequences were extracted from the Rfam alignments using the same protocol as for the 
CRW alignments described above. Specifically, no pair of sequences extracted from the same 
alignment share a pairwise percent sequence identity above an alignment-specific threshold (max 
85%, Table[3]). Consensus RNA structures were projected onto individual sequences by removing 
base pairs at gap positions, and by removing any non-canonical base pairs. The mean sequence 
length of the resulting 187 Rfam sequences is 247 nt, and the combined dataset has an average 
sequence length of 778 nt (Table [2]), see Table [3] for a description of biological class and sequence 
extraction details, and Table [4] for alignment quality metrics. 

For a given multiple-sequence alignment, the covariation is defined as: 

E^i ) 6=i,a<6 (Es^H^bibj) - n##(a i a i ,& i &;))) 



covariation . 

(2) 

where 

• Sij is the set of base pairs i and j in the consensus secondary structure. 

• M is the number of sequences in the alignment. 

• H(aidj,bibj) is the Hamming distance between the strings ajOj and bibj. 

• Tl®j is an indicator function such that if dj and ctj can form a canonical base-pair, and bi 
and bj can also form a canonical base-pair, IT^ = 1 (otherwise II?- = 0). 

• Q^j is an indicator function such that if ai and a,j and/or bi and bj cannot for a canonical 
base-pair, £1^ = 1 (otherwise 0$ = 0). 

(2) Definition of performance metrics Structure prediction accuracy is measured on a 
base pair level. True positives (TP) are correctly predicted base pairs. False positives (FP) are 
incorrectly predicted base pairs that are not part of the reference structure. True negatives (TN) 
are hypothetical base pairs that are not predicted, nor part of the reference structure. False 
negatives (FN) are reference base pairs missed by the prediction. Performance metrics for true 
positive rate (TPR), false positive rate (FPR), positive predictive value (PPV), and Matthew's 
correlation coefficient (MCC) are defined as follows: 

TP 

TPR = 100 (3) 

TP + FN V ; 

ATPR = TPRcoFoid ~ TP^RNAfoid (4) 

FP 

FPR = 100 • — — — (5) 

FP+TN y ' 



AFPR — FPR CoFo \ d — -F-P-RRNAfoid 



(6) 
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TP 



(7) 



PPV = 100 • 



TP + FP 



APPV = PPVboFold " PPV RNA Md 



(8) 



MCC = 100 • 



TP xTN - FP x FiV 



(9) 



y/(TP + FP) x (TP + FN) x (TiV + FP) x (TiV + FiV) 



(3) Incorporating co-transcriptional folding into the prediction algorithm of CoFold 
The Nussinov algorithm!^ was one of the first attempts at RNA 6 prediction. It is a dynamic 
programming method that efficiently calculates the B with the largest number of base pairs in 
0(L 3 ) time, where L denotes the length of the input sequence. The algorithm solves the problem 
recursively by determining the optimal structure for sub-sequences, and using these solutions 
to derive optimal structures for successively larger subsequences. The output structure is the 
optimal solution for the full sequence. This algorithm, however, has several shortcomings. First, 
base pairs vary in stability; for example, G-C pairs are energetically more favourable than A-U 
pairs. The Nussinov algorithm weights all pairs equally. Second, The stability of a base pair 
depends highly on its neighbouring base pairs due to so-called stacking interactions between 
adjacent pairs, and this contextual effect is ignored by the algorithm. 

The Zuker-Stiegler algorithm 3 is an advancement of the Nussinov algorithm. Rather than 
predicting the structure with the greatest number of pairs, the Zuker-Stiegler algorithm predicts 
the most thermodynamically favourable (and pseudo-knot free) RNA structure according to a 
set of free energy parameters. This structure is also called the minimum-free-energy (MFE) 
structure. The algorithm assigns a sequence-specific free energy value to various structural 
building blocks, such as stacking interactions between pairs of adjacent base pairs, unpaired 
nucleotides, and hairpin loops. The algorithm utilises dynamic programming similarly to the 
Nussinov algorithm, but calculates two energy values for all subsequences Sij of a given input 
sequence S, where 1 <= i < j <= L: 

j : minimum free energy of subsequence Si j given nucleotides i and j form a base pair 
• FMLij: minimum free energy of subsequence Sij 



' hairpimj open a helix with hairpin loop 

Cy = min < mini < p < q < j{Cp :q + Stackuj^ u,^} stack, bulge or internal loop 

^min^i^i 2{F-MXj_|_fc + dangle} open a helix with nested substructure 



(10) 



FMLij = min < 




branched structures 



close a helix 



unpaired nucleotide 
unpaired nucleotide 



(11) 
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Cij and FMLij are calculated for each subsequence Sij as the minimum of a well-defined 
set of rules (Equation (10), Equation (11)). The minimum free energy can be retrieved from 
the value at FMLi^l, where L denotes the length of the input sequence. The corresponding 
MFE structure is retrieved by backtracking through the Cij and FMLij matrices. 

The Zuker-Stiegler algorithm requires a large set of thermodynamic parameters. In 1999, 
the Turner group published one such model, which included a combination of experimentally 
measured energies and estimated values^. This parameter set (called Turner 1999 parameter set) 
is widely used by many state-of-the-art tools, including RNAfold^ and MfoldHI. Andronescu 
et al. improved estimated values in the Turner 1999 parameter set by applying sophisticated 
machine learning techniques to training 363 free parameter values^ . These parameters were 
adjusted using a training set of 3439 reference structures and 946 thermodynamic measurements 
by optical melting resulting in the Andronescu 2007 parameter set. They observed an average 
performance increase of 7% on a test set of 1660 sequences containing several biological classes, 
including tRNA, RNase P, rRNA and SRP RNA. 

The Zuker-Stiegler algorithm traditionally considers only the change in free energy for a 
given RNA fi conformation in thermodynamic equilibrium, but does not consider the process of 
RNA structure formation, i.e. how the RNA sequence arrives at the MFE structure. Rather, 
the Zuker-Stiegler algorithm implicitly assumes that the input RNA sequence (1) is already 
fully synthesised, (2) in thermodynamic equilibrium and (3) will always be able to reach the 
RNA structure that minimises the overall free energy of the molecule. We know from a range of 
experiments, however, that RNA molecules start to fold while they emerge during transcription, 
that they are not necessarily in thermodynamic equilibrium during structure formation in vivo 
and that they may get trapped on their kinetic folding pathway. That RNA molecules overall 
proceed towards the MFE structure over time is only an approximation of the complex reality in 
vivo. As the molecule emerges from the polymerase, however, local structures immediately begin 
to form. Formation of long-range base pairs may require disruption of these local structures, and 
their folding rate may be prohibitively slow due to high energy barriers. That is, the molecule 
may never reach the minimum free energy structure due to kinetic considerations. The structure 
formation in vivo may get further complicated due trans interactions between the RNA sequence 
and other molecules in the living cell which we ignore for now. 

We propose a new method for RNA B prediction, CoFold, that takes into account some 
effects of co-transcriptional folding. The key effect that we aim to model is that during co- 
transcriptional folding in vivo, it does matter to a given sequence position whether a potential 
pairing partner is available for base-pairing or not. To capture this, we model the distance 
along the sequence between base pairing sequence positions. CoFold is a modification to the 
Zuker-Stiegler algorithm^, and it was implemented using the RNAfold source code from the 
ViennaRNA package! 23221 . 

CoFold calculates energies in the same fashion as in RNAfold, but all energy contribu- 
tions associated with a base pair are modified by a scaling function according to the number of 
nucleotides between the pair (i.e. the distance d). This scaling function j(d) models the expo- 
nential decay in reachability as function of the nucleotide distance d between the two potential 
pairing partners and depends on two parameters a and r (Equation (12), Figure [6]). Both pa- 
rameters have a straightforward interpretation. The value of a specifies the range of the scaling 
function (e.g. when a is 0.2, the affected free energies will range from 80% to 100% of their 
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original values). The value of r determines the rate of the exponential decay, where low values 
of r result in a steep decay function. 

7(d) := a ■ (e - - - 1) + 1 (12) 

The scaling function 7(d) is only used in conjunction with energy values in the Cy calculation 
because these correspond to predicted base pairs. The function is not applied to the energy of 
subsequences in order to avoid multiple applications to the same value. The function is applied 
both to elements with positive energy, such as loops and bulges, as well as to those with negative 
energy, such as stacking interactions. This is necessary to preserve the relative magnitude of 



the contributions from structural components, see Equation (13) and Figure Pfl for modified C[ 



The FMLij calculation remains the same as in RNAfold. 



Q,j = Tnin <• 



' ' hairpiiiij open a helix with hairpin loop 

{Cp, q + 7(dj,j) • Stack^j-j^p^} stack, bulge or internal loop 

mink zei 2{FML i+ k j_i + ^(di j) ■ dangle} open a helix with nested substructure 

(13) 

The output of CoFold is an RNA 6 which promotes base pairs according to the above 
scaling function. The predicted RNA Q therefore captures both thermodynamic contributions 
as well as effects due to co-transcriptional structure formation. Like RNAfold, CoFold allows 
the user to select a thermodynamic parameter set and the running time of CoFold also scales 
with 0(L 3 ), where L denotes the length of the input sequence. For performance evaluation, we 
use both the Turner 1999 (CoFold) and the Andronescu 2007 (CoFold-A) parameter sets 
introduced above. 

(4) Parameter training CoFold has two free parameters: a and r. Due to the small 
number of parameters, they were trained using a simple brute force scheme. CoFold was run 
on all sequences of the combined data set and performance metrics were calculated for each (a, r) 



combination in set P (defined in Equation (14)). The Turner 1999 thermodynamic parameter 
set^was used for (a,r) parameter training. 

P := {0.05, 0.10, . . . , 0.90, 0.95} x {40, 80, ... , 1160, 1200} (14) 



MCC S a ,-r == E ' e5 ^ T (15) 
where (a, r) 6 P and S is the sequence set 



AMCC; T := MCC^ T - MCC RNMold (16) 

Performance metrics were found to be highly correlated in a and r (Figure [8] (right), Fig- 
ure J9J). To demonstrate this, linear regression was performed on the AMCC matrix (Fig- 
ure p^f (left)). We first compiled a set of triples Q = {(a, r, AMCC aiT )}, for which AMCC ajT 



I. 



is in the 97 quantile of the performance matrix. Weighted linear regression was performed 
with a and r as dimensions, and AMCC as the weight. The regression line fits the data with 
an R 2 value of 98.4%, indicating that variability in r highly accounts for the variability in a. 
Regression line (solid) and its 95% confidence region (dotted) are plotted in Figure [8] (left). 

Twenty trials of five-fold cross validation were performed to determine robustness of param- 
eter training. In each trial, the combined data set D is randomly divided into five partitions 
D{. For each partition, the optimal parameter combination is determined for the remaining 
sequences (Equation (|17|)). The cross validation results are plotted in Figure [8] (right), where 
the integer in each cell indicates the number of trials where that parameter combination was 
optimal. The optimal parameter values highly cluster around the linear regression line show in 
Figure[| (left). 

(a op t,T opt ) :=(a, r) s.t. AMCCl >T = max(AMGG T ) (17) 
where T := D \ D{ 

The default parameter combination for CoFold is a = 0.5, r = 640. This parameter set 
maximises MCC for the combined dataset. The default parameter combination is marked with 
an "X" in Figure [8] (left) which shows that it lies directly on the linear regression line. 

(5) Calculation of free energy differences We define AAG as the difference between the 
free energy (AG) of a given prediction and the corresponding RNAfold prediction. We cal- 
culate these values for RNAfold-A, CoFold, and CoFold-A. Because the Andronescu 2007 
parameters use modified the free energy values, we use RNAeval from the ViennaRNA pack- 
ag d 37 * 38 ! to calculate the free energy of each predicted structure on an equal footing. Unlike 
RNAfold which predicts a minimum free energy structure from a sequence, RNAeval cal- 
culated the free energy for an input RNA structure according to the provided thermodynamic 
parameters. For consistency, we use the parameters Turner 1999 thermodynamic model^ for all 
AAG calculation. For a prediction program X which corresponds to RNAfold-A CoFold or 
CoFold-A, we define AAGx and %AAGx as follows. 

AAG X = AG X - AG R NAfoid (18) 



(19) 
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Tables 





loner rial" a Qfil" 


combined data set 


— 1 — J 

clade 


> 1UUU nt 


all 


< i nnn nt 

\ 1UUU III 




15 


69 


(54) 


Eukaryotes 


15 


112 


(97) 


Virus 





20 


(20) 


Archea 


17 


33 


(16) 


Chloroplast 


14 


14 


(0) 


sum 


61 


248 


(187) 


sequence length (nt) 








average 


2397 


776 


(247) 


minimum 


1245 


110 


(110) 


maximum 


3578 


3578 


(628) 



Table 2. Evolutionary composition and length statistics for the long and the 
combined data set. Numbers in brackets specify the respective numbers for the short 
sequences in the combined data set. 
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biological class 


A.len 
(nt) 


clade 


N.seq 


N.ext 


max 
ppid 

(%) 


source 


ID 


16S rRNA (archea) 


1545 


A 


40 


8 


85 


CRW 


16S archaea 


23S rRNA (archea) 


3153 


A 


40 


9 


85 


CRW 


23S archaea 


16S rRNA (bacteria) 


1661 


B 


144 


7 


70 


CRW 


16S bacteria 


23S rRNA (bacteria) 


3046 


B 


40 


8 


85 


CRW 


23S bacteria 


16S rRNA (chloroplast) 


1558 


C 


40 


5 


85 


CRW 


16S chloroplast 


23S rRNA (chloroplast) 


3722 


C 


40 


9 


80 


CRW 


23S chloroplast 


16S rRNA (eukaryote) 


1867 


E 


40 


7 


85 


CRW 


16S eukaryote 


23S rRNA (eukaryote) 


4105 


E 


40 


8 


85 


CRW 


23S eukaryote 


snRNA 


184 


E 


87 


14 


80 


Rfam 


RF00003 


U2 spliceosomal RNA 


270 


E 


181 


10 


50 


Rfam 


RF00004 


Nuclear RNase P 


622 


E 


77 


11 


45 


Rfam 


RF00009 


snoRNA 


236 


E 


14 


9 


85 


Rfam 


RF01256 


snoRNA 


394 


E 


4 


1 


85 


Rfam 


RF01267 


snoRNA 


373 


E 


18 


9 


85 


Rfam 


RF01296 


U4 spliceosomal RNA 


273 


E 


160 


11 


50 


Rfam 


RF00015 


U5 spliceosomal RNA 


178 


E 


153 


9 


45 


Rfam 


RF00020 


ciliate telomerase RNA comp. 


270 


E 


19 


11 


80 


Rfam 


RF00025 


RNase MRP 


903 


E 


40 


12 


50 


Rfam 


RF00030 


RNase P 


511 


B 


88 


8 


60 


Rfam 


RF00011 


CsrB RNA 


391 


B 


11 


7 


85 


Rfam 


RF00018 


lysine riboswitch 


232 


B 


37 


14 


65 


Rfam 


RF00168 


Mg riboswitch - Ykok leader 


216 


B 


85 


14 


65 


Rfam 


RF00380 


Ornate extremophilic RNA 


676 


B 


7 


6 


85 


Rfam 


RF01071 


Pestivirus IRES 


286 


V 


23 


5 


85 


Rfam 


RF00209 


Tombus virus 5' UTR 


180 


V 


7 


5 


85 


Rfam 


RF00171 


Aphthovirus IRES 


471 


V 


87 


4 


85 


Rfam 


RF00210 


Cripavirus IRES 


208 


V 


6 


6 


80 


Rfam 


RF00458 


tRNA-like structures 


137 


V 


5 


5 


80 


Rfam 


RF01084 


Archaeal RNase P 


533 


A 


25 


16 


80 


Rfam 


RF00373 



Table 3. RNA families of the long and the combined data set. All sequences of the 
long data set derive from alignments of the CRW data base (top), whereas the short sequences 
from the combined data set all derive from alignments of the Rfam data base (bottom) . For 
each original alignment from either data base, i.e. each row in this table, we specify the 
alignment length in nucleotides (A.len), the evolutionary origin of its sequences (clade, A - 
Archea, B - Bacteria, C - Chloroplast, V - Virus, E - Eukaryotes), the number of sequences 
(N.seq), data base (source) and identifier in that data base (ID). We also specify, for each 
original alignment, how many sequences we extracted (N.ext) and what their maximum 
pairwise percent identify is in terms of primary sequence conservation (max. ppid) . IRES 
stands for internal ribosomal entry site. 
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alignment (ID) 


A.len 


av. seq. 


av. 


gaps 


n 






bpairs 


canonical 


covar. 




(nt) 


length 
(nt) 


ppid 

(%) 


(%) 


(9 


*) 






bpairs 

(%) 




16S archaea 


1545 


1477 


81.8 


4.4 


5 


■ 10" 


-7 


458 


95.2 


0.343 


23S archaea 


3153 


2945 


74.9 


6.6 


6 


• 10" 


-7 


852 


95.0 


0.408 


16S bacteria 


1661 


1520 


76.7 


8.5 


2 


• 10" 


-2 


453 


93.4 


0.352 


23S bacteria 


3046 


2904 


79.2 


4.6 


6 


• 10" 


-7 


868 


94.3 


0.358 


16S chloroplast 


1558 


1490 


90.2 


4.4 


5 


• 10" 


-7 


440 


93.9 


0.113 


23S chloroplast 


3722 


2941 


74.8 


21.0 









869 


90.1 


0.253 


16S eukaryote 


1867 


1708 


73.3 


8.5 


1 


• 10" 


-7 


370 


84.3 


0.162 


23S eukaryote 


4105 


3476 


78.7 


15.3 


1 


• io- 


-7 


998 


88.1 


0.084 


RF00003 


184 


162 


63.1 


11.8 









40 


93.2 


0.493 


RF00004 


270 


193 


58.4 


28.4 


1 


• 10" 


-2 


45 


92.8 


0.496 


RF00009 


622 


315 


40.7 


49.3 


8 


• io- 


-5 


62 


89.3 


0.397 


RF01256 


236 


208 


60.6 


11.9 









54 


92.7 


0.457 


RF01267 


394 


384 


72.5 


2.6 









128 


94.3 


0.295 


RF01296 


373 


325 


63.7 


13.0 









57 


91.4 


0.339 


RF00015 


273 


147 


52.2 


46.2 


5 


• 10' 


-5 


31 


91.5 


0.604 


RF00020 


178 


117 


51.7 


34.1 


4 


• 10" 


-5 


30 


94.0 


0.694 


RF00025 


270 


186 


42.5 


31.1 









39 


86.4 


0.395 


RF00030 


903 


303 


34.7 


66.5 









74 


88.3 


0.470 


RF00011 


511 


373 


63.0 


27.1 









105 


95.1 


0.500 


RF00018 


391 


350 


62.4 


10.5 









49 


96.8 


0.368 


RF00168 


232 


183 


46.1 


21.2 









53 


90.3 


0.580 


RF00380 


216 


170 


59.6 


21.4 









47 


94.5 


0.471 


RF01071 


676 


609 


59.9 


9.9 









159 


90.2 


0.378 


RF00209 


286 


275 


89.2 


3.9 









75 


98.8 


0.191 


RF00171 


180 


159 


67.3 


11.4 









34 


97.9 


0.403 


RF00210 


471 


461 


85.4 


2.1 









122 


98.3 


0.181 


RF00458 


208 


201 


55.4 


3.5 









60 


95.0 


0.757 


RF01084 


137 


128 


51.0 


6.9 









43 


97.2 


0.795 


RF00373 


533 


311 


49.4 


41.7 









87 


90.0 


0.537 



Table 4. Alignment quality and phylogenetic support for the reference RNA fis. 

For each original alignment, i.e. each row in this table, we specify the alignment length in 
nucleotides (A.len), the average length of each non-gapped sequence in that alignment (av. seq. 
length) , the average pairwise percent identity between pairs of sequences in the alignment in 
terms of primary sequence conservation (av. ppid), the average fraction of gaps per sequence in 
the alignment (gaps), the average fraction of ambiguous (not A,C,G,T,U,-) nucleotide symbols 
per sequence in the alignment (n) , the number of base pairs in the reference RNA fi for that 
alignment (bpairs), the average fraction of sequences in the alignment that have a consensus 
base-pair per conserved base-pair of the reference fi (canonical bpairs) and the covariation 
(covar.) as defined in Rfam^ which measures how well the base pairs of the reference RNA 13 
are supported by co- variation (high means good). 
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Ion 


g data set 








TPR (%) 


FPR (%) 


PPV (%) 


MCC (%) 


RNAfold 


46.30 


0.0176 


39.74 


42.81 


RNAfold-A 


52.02 


0.0160 


44.76 


48.17 


COFOLD 


52.83 


0.0159 


45.79 


49.10 


CoFold-A 


57.80 


0.0145 


50.06 


53.70 


combined data set 




TPR (%) 


FPR (%) 


PPV (%) 


MCC (%) 


RNAfold 


57.87 


0.1132 


45.27 


50.86 


RNAfold-A 


58.98 


0.1152 


46.16 


51.83 


CoFold 


60.38 


0.1097 


47.56 


53.26 


CoFold-A 


61.51 


0.1112 


48.42 


54.22 




short sequences o 


nly 






TPR (%) 


FPR (%) 


PPV (%) 


MCC (%) 


RNAfold 


61.64 


0.1444 


47.08 


53.48 


RNAfold-A 


61.25 


0.1475 


46.61 


53.02 


CoFold 


62.84 


0.1403 


48.14 


54.62 


CoFold-A 


62.72 


0.1428 


47.88 


54.39 



Table 5. CoFold predictive power for base pairs for all data sets. The performance 
accuracy of CoFold and CoFold-A compared to RNAfold and RNAfold-A for the test 
set as measured in terms of true positive rate (TPR = 100 • TP/ (TP + FN)), false positive 
rate (FPR = 100 • FP/(FP + TN)), positive predictive value (PPV = 100 • TP/(TP + FP)) 
and Matthew's correlation coefficient 

(MCC = 100 • (TP -TN -FP- FN)/^J(TP + FP) ■ (TP + FN) ■ (TN + FP) ■ (TN + FN)), 
where TP denotes the numbers of true positives, TN the true negatives, FP the false positives 
and FN the false negatives. 

Summary of % AAG distributions 





long data set 


combined data set 




> 1000 nt 
av. ± stdev min max 


all 

av. ± stdev min max 


< 1000 nt 
av. ± stdev min max 


RNAfold-A 

CoFold 

CoFold-A 


4.7±1.9 1.4 11.1 
1.8 ± 1.0 0.2 4.4 
6.8 ±2.4 1.7 13.1 


5.0 ±3.5 -2.3 15.4 
0.5 ±1.2 -5.0 4.4 
5.9 ±3.8 -2.3 18.2 


5.1 ±3.9 -2.3 15.4 
0.1 ±0.9 -5.0 3.8 
5.6 ±4.1 -2.3 18.2 



Table 6. Summary of relative free energy difference distributions of predicted 
structures w.r.t. the MFE structured predicted by RNAfold for the same input 
sequences, for all data sets. 
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Linear fit to A MCC versus % AAG distributions 





long data set 




> 1000 nt 






intercept ± stdev slope ± stdev 


R 2 (%) 


RNAfold-A 


7.0 ±2.4 -0.34 ±0.48 


0.85 


COFOLD 


3.5 ±1.6 1.52 ±0.78 


6.06 


CoFold-A 


9.2 ±3.1 0.25 ±0.43 


0.56 




combined data set 






intercept ± stdev slope ± stdev 


R 2 (%) 


RNAfold-A 


1.0 ± 1.4 0.0008 ±0.23 


5.6 • 10~ uti 


COFOLD 


2.1 ±0.6 0.59 ±0.47 


0.64 


CoFold-A 


2.1 ±1.6 0.21 ±0.23 


0.34 




short sequences only 






< 1000 nt 






intercept ± stdev slope ± stdev 


R 2 (%) 


RNAfold-A 


-0.8 ± 1.6 0.06 ±0.25 


0.03 


COFOLD 


1.3 ±0.7 -2.21 ±0.75 


4.44 


CoFold-A 


0.7 ± 1.7 0.03 ±0.25 


0.01 



Table 7. Details of the linear fits to the A MCC versus % AAG distributions. 
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Figures 
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Figure 5. Changes in prediction accuracy for the structures predicted by CoFold 
for all sequences of the long data set. We report the prediction accuracy for base pairs in 
terms of relative changes in true positive rate (TPR = 100 • TP/ (TP + FN)) and false positive 
rate (FPR = 100 • FP/(FP + TN)) by comparing the prediction accuracy of the structures 
predicted by CoFold to those predicted by RNAfold. TP denotes the numbers of true 
positives, TN the true negatives, FP the false positives and FN the false negatives. 




Figure 7. Details of the sequence coordinates affected by the scaling function of 
CoFOLD in Equation (13). 
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Figure 8. Training of parameters in CoFold: linear fit and robustness. Left figure, 
heatmap showing the average MCC differences w.r.t. RNAfold as function of the r (x-axis) 
and a (y-axis) parameters values. The average MCC differences are indicated via the colours 
from high (bright yellow) to low (dark red), see Figure 9 for details. The solid line corresponds 
to the linear regression line (a = a ■ r + b with a slope of a = 6.1 • 10 -4 ± 2 • 10 -5 and an 
intercept of b = 0.105 ± 0.016). The two dotted lines delineate the 95% confidence region. The 
asterisk shows parameter pair with highest average MCC (a = 0.50 and r = 640) which is the 
parameter combination used in CoFold and CoFold-A. Right figure, same heatmap as in 
left figure, but this time showing the count of trials in 20 trials of five- fold cross-validation 
where that the corresponding pair of parameter values has the highest average MCC for the 
set of training sequences. 
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Figure 9. Colour coding for the heatmaps for parameter training of CoFold. 
Histogram of the average MCC differences of CoFold w.r.t. RNAfold, ranging from light 
yellow for the largest improvement of performance accuracy to dark red for a small 
performance improvement. A grey colour in Figure 8 corresponds to an improvement accuracy 
that is smaller than the range covered in this histogram. 
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Figure 10. Relative free energy difference distributions of the predicted structures 
w.r.t. the MFE structures predicted by RNAfold for the same input sequences, 
for all data sets. Results for the long data set (left column), the combined data set (middle 
column) and the short sequences of the combined data set (right column). For each data set, 
three histrograms show the relative free energy differences of the RNA structures predicted by 
RNAfold-A w.r.t. the MFE structures predicted by RNAfold for the same sequence (top 
row), of the RNA structures predicted by CoFold w.r.t. the MFE structures predicted by 
RNAfold (middle row) and of the RNA structures predicted by CoFold-A w.r.t the 
MFE structures predicted by RNAfold- A (bottom row). The free energies of all structures 
are calculated using the Turner 1999 energy parameters. 
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Figure 11. Differences in prediction accuracy versus relative free energy changes 
of the predicted structures w.r.t. the MFE structures predicted by RNAfold for 
the same input sequences, for all data sets. Results for the long data set (left column), 
the combined data set (middle column) and the short sequences of the combined data set 
(right column). For each data set, three figures show the change in performance accuracy in 
terms of MCC versus the relative change of free energy for the structures predicted by 
RNAfold-A (top row) w.r.t. the RNA structures predicted by RNAfold for the same 
sequence, for the structures predicted by CoFold (middle row) w.r.t. the RNA structures 
predicted by RNAfold for the same sequence and for the structures predicted by CoFold-A 
(bottom row) w.r.t. the RNA structures predicted by RNAfold for the same sequence. The 
free energies of all structures are calculated using the Turner 1999 energy parameters. 



